#
Not long after sending out this tweet on 7 August 2018, Elon Musk was facing a lawsuit by the U.S. Securities and Exchange Commission (SEC) for making “false and misleading statements” and for market manipulation. Musk implied with the tweet that he secured an offer to take Tesla private at the stated price, which was substantially above its actual trading price, although no real arrangements had been made, nor any offers met. According to the SEC his tweet “set off a trading frenzy,” and pushed Tesla’s stock price up by more than 6 percent, forcing the NASDAQ exchange to halt Tesla trading for 90 minutes until the company gave an official response. The company’s stock price closed at $379.57 on the day of the tweet. Two months later, Musk agreed to a settlement which required him and Tesla individually to pay a 20$ million fine and he in addition had to step down from Tesla’s board. 1
Elon Musk by Dr. Seuss (Poem)
“The SEC said:
”Musk, your tweets are a blight.
They really could cost you your job,
if you don’t stop
all this tweeting at night."
…Then Musk cried:
“Why? The tweets I wrote are not mean,
I don’t use all-caps and I’m sure that my tweets are clean.”
“But your tweets can move markets
and that’s why we’re sore.
You may be a genius and a billionaire,
but that doesn’t give you the right to be a bore!”
— OpenAI, co-founded by Musk, AI-generated poem
The idea of this R notebook is to introduce everyone interested in data science and machine learning to effective communication of data analysis and statistical findings by leveraging suitable visualisations. On the side, we also take a look at a way to analyse the evolution of Tesla’s stock price and the influence Elon Musk’s tweets had on Tesla’s stock price. For the purpose of visualising analyses and findings, the ggplot2 and plotly packages are used since they enable producing high-quality, publication-ready visualisations for static as well as dynamic and interactive applications. Both packages are built around the framework of the so-called Grammar of Graphics, a scientific syntax for effective data visualisations, which describes how specific elements or layers of a plot should be seperated and classified for a structured approach to visualisations. For more information, see Hadley Wickham (2010) - A Layered Grammar of Graphics and Wilkinson (2011) - The Grammar of Graphics.
I can also greatly recommend these following resources:
#
# TODO: Add time of stock split to time series, search for short seller tweets in data, add a log scale plot
# to Tesla's stock price chart, add most recent stock return on distribution, think about colour choice (restrict it)
# Load packages
# TODO: Create automatic package installation for students
library(conflicted)
library(gapminder)
library(httr)
library(rtweet)
library(quantmod)
library(Quandl)
library(pins)
library(tidyverse)
library(lubridate)
library(tsbox)
library(tidytext)
library(DT)
library(ggrepel)
library(plotly)
library(wordcloud2)
# library(viridis)
library(viridisLite)
library(RColorBrewer)
# Conflicted: hierarchy in case of conflict
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("first", "dplyr")
conflict_prefer("last", "dplyr")
conflict_prefer("lag", "dplyr")
conflict_prefer("layout", "plotly")
# Some options for quantmod package
options("getSymbols.warning4.0" = FALSE)
# Color settings
# viridis_pal(n = 10)
palette(viridis(n = 10))
# palette(brewer.pal(n = 11, name = "RdYlGn"))
col_palette_blue <- brewer.pal(n = 9, name = "PuBu")
col_palette_green <- brewer.pal(n = 9, name = "YlGn")
To start with, we get Tesla stock data (ticker = “TSLA”) from Yahoo Finance by using the quantmod package. All that is required to download the data is the ticker of the corresponding financial instrument.
getSymbols(Symbols = "TSLA",
src = "yahoo",
verbose = F)
## [1] "TSLA"
Second, we also get S&P 500 index (SPY ETF) data (ticker = “SPY”) from Yahoo Finance.
getSymbols(Symbols = "SPY",
src = "yahoo",
verbose = F)
## [1] "SPY"
And finally, we download NASDAQ index data (ticker = “^IXIC”) from the same source.
getSymbols(Symbols = "^IXIC",
src = "yahoo",
verbose = F)
## [1] "^IXIC"
Next, we do some data wrangling to transform Tesla stock data into a tibble with the dplyr and tsbox packages and rename its columns. Tibbles are enhanced data.frames around which the tidyverse packages (and a great many other packages) are built. They provide a standardised way of storing data comming in diverse formats. I also use the pipe operator %>%, to make the workflow and required steps easy to grasp and adjust later on (see picture below for a short explanation).
df_Tesla_stock_data <- TSLA %>%
ts_tbl() %>%
ts_wide() %>%
rename(Date = time,
Open = TSLA.Open,
High = TSLA.High,
Low = TSLA.Low,
Close = TSLA.Close,
Volume = TSLA.Volume,
Adjusted = TSLA.Adjusted)
The Tesla stock data now looks like this, with daily observations for each trading day organised in the rows and seven different variables, also called features in the ML context, in the columns. For each of the daily 2’575 observations, we have the corresponding date in the Date column, the Openning stock price at trading start on the exchange, the daily Highest and Lowest price, the Close at end of trading, the trading Volume, and finally an Adjusted price, accounting for stock splits, dividends, and similar corporate actions.
datatable(df_Tesla_stock_data)
We do the same for the S&P 500 (SPY ETF) index data as well as the NASDAQ index.
df_SPY_data <- SPY %>%
ts_tbl() %>%
ts_wide() %>%
rename(Date = time,
Open = SPY.Open,
High = SPY.High,
Low = SPY.Low,
Close = SPY.Close,
Volume = SPY.Volume,
Adjusted = SPY.Adjusted)
df_NASDAQ_data <- IXIC %>%
ts_tbl() %>%
ts_wide() %>%
rename(Date = time,
OpenNASDAQ = IXIC.Open,
HighNASDAQ = IXIC.High,
LowNASDAQ = IXIC.Low,
CloseNASDAQ = IXIC.Close,
VolumeNASDAQ = IXIC.Volume,
AdjustedNASDAQ = IXIC.Adjusted)
The S&P 500 (SPY ETF) and NASDAQ index series have a few more observations than the Tesla series, i.e. both have data points on 3’453 days. Otherwise, it is in the same format. Here is how the S&P 500 time series looks like:
datatable(df_SPY_data)
Finally, we add all three stock price and index time series together to have them available in a single tibble.
df_Tesla_SPY_NASDAQ <- df_SPY_data %>%
full_join(df_Tesla_stock_data,
by = "Date",
suffix = c("SPY", "TSLA")) %>%
full_join(df_NASDAQ_data,
by = "Date")
In addition, we now compute the (continuous) stock returns for both financial instruments.
df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
mutate(ReturnsSPY = log(AdjustedSPY) - lag(log(AdjustedSPY)),
ReturnsTSLA = log(AdjustedTSLA) - lag(log(AdjustedTSLA)),
ReturnsNASDAQ = log(AdjustedNASDAQ) - lag(log(AdjustedNASDAQ)))
Next, we scrap Tweets data from Elon Musk’s and Tesla’ official Twitter account with the rtweet package. Unfortunately, only the most recent 3’212 tweets per user are available, because Twitter limits access to historical data in order to commercially offer it instead. Tweet scrapping requires a Twitter account and a developer registration for the free Twitter API. This is fairly easy to set up, however, and should only take a couple of minutes.
df_tweets_elon_musk <- get_timeline("elonmusk", n = 5000)
df_tweets_tesla <- get_timeline("Tesla", n = 5000)
The Tweets dataset is rather big in size with 90 columns. Thus, only a subset of the columns are shown here to get an idea of how the data set for Elon Musk’s tweets looks like:
df_tweets_elon_musk %>%
select(user_id, created_at, screen_name, text, source,
is_retweet, favorite_count, retweet_count, hashtags) %>%
datatable(filter = "top",
options = list(pageLength = 5,
autoWidth = T))
…and Tesla’s official Twitter account:
df_tweets_tesla %>%
select(user_id, created_at, screen_name, text, source,
is_retweet, favorite_count, retweet_count, hashtags) %>%
datatable(filter = "top",
options = list(pageLength = 5,
autoWidth = T))
Now we’re ready to take the Tesla stock price data and create a basic ggplot2 time series chart. We need the above mentioned Grammar of Graphics to set up each specific layer in the plot. First, we need to map the data to so-called aesthetics in the plot. Aesthetics are defined within the aes() function in ggplot2 and include plot specifications such as what goes on the x-axis and y-axis, what is shown in which colour, how the size of an object in a plot is determined and many more. For our basic time series plot, we simply map the Date column from the stock data to the x-axis and the Adjusted stock price to the y-axis. The only additional layer to add to get a finished plot now is a so-called geom (short for geometric objects). Geoms determine the kind of plot we want to display and are added with the set of geom_... functions. Here, we’d like to create a simple line plot with geom_line(). First, we add a new layer to the plot by using the + operator. Then we set the line geom and after saving the plot to a new R object we have our first plot.
p_basic_time_series_Tesla <- ggplot(data = df_Tesla_stock_data,
aes(x = Date, y = Adjusted)) + # Close
geom_line()
p_basic_time_series_Tesla
So far, so good. However, the plot doesn’t look particularly great, does it? The grey background is rather irritating, the date on the x-axis is only displayed every five years, it’s unclear in what units the y-axis is shown, and in general, there’s no title or anything to really indicate what is exactly shown here. The only information we have is the evolution of the series over a time period of 10 years and its corresponding values on the y-axis. We need to adjust some basic layers of the plot.
For a visual overview and explanations of the different layers in ggplot2’s Grammar of Graphics, see this Towards Data Science article:
ggplot2 Grammar of Graphics
We start by adjusting the scales of the x- and y-axes in a new layer, the scales layer. We copy the code from above and additionally add scale_x_... and scale_y_.. functions with proper arguments.
p_basic_time_series_Tesla_w_scales <- p_basic_time_series_Tesla +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = 1750, by = 250))
p_basic_time_series_Tesla_w_scales
The theme of a plot is yet another layer in the “Grammar of Graphics”. Setting a beautiful theme will help us to get rid of the irritating grey background. Let’s try the theme_classic() function.
p_basic_time_series_Tesla_w_scales_and_theme <- p_basic_time_series_Tesla_w_scales +
theme_classic()
p_basic_time_series_Tesla_w_scales_and_theme
theme_classic() is quite a beautiful and simplistic theme. For the purpose of interpreting a time series plot, however, a theme including a grid is more appropriate. Thus, in the following plots we use theme_light() instead. We also would like to add a proper title. Plot main and subtitles as well as axis labels are set with the labs() function. In addition, we accentuate the x- and y-axis by plotting it in thicker size than the remaining background grid lines. Let’s also adjust the label of the y-axis to make it clearer what it represents. Finally, let’s add a caption with the copyright for the plot. Now we have our first complete time series plot.
p_basic_time_series_Tesla_w_scales +
theme_light() +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75)) +
labs(title = "Tesla Stock Price",
subtitle = "Rising Higher and Higher...",
y = "Close (Adjusted)",
caption = "© Data Science & Technology Club HSG")
For the following plots, let’s set a global default ggplot2 theme, instead of adding it manually to each plot.
theme_set(theme_light())
To improve further on our plot, we can add a so-called benchmark to it. A benchmark is, e.g., another time series to compare the Tesla stock price to. We use the previously gathered S&P 500 prices to do exactly that. In order to be able to compare the prices of the two series and to get them into the same y-axis limits, some data wrangling and rebasing is required. While the S&P 500 is a sensible measure of the broad overall U.S. stock market to compare Tesla to, one could argue that Tesla is more of a technology company and thus should rather be compared to the NASDAQ index instead. Hence, we also add the NASDAQ index as a benchmark.
df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
mutate(AdjustedTSLARebased = AdjustedTSLA / first(df_Tesla_stock_data$Adjusted),
AdjustedSPYRebased = AdjustedSPY / first(df_SPY_data$Adjusted),
AdjustedNASDAQRebased = AdjustedNASDAQ / first(df_NASDAQ_data$AdjustedNASDAQ))
p_time_series_Tesla_vs_SPY <- df_Tesla_SPY_NASDAQ %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = AdjustedTSLARebased), col = palette()[4]) +
geom_point(aes(x = last(Date),
y = last(AdjustedTSLARebased)),
col = palette()[4],
shape = 1,
size = 1.5) +
geom_text(label = "TSLA",
aes(x = last(Date),
y = last(AdjustedTSLARebased)),
color = palette()[4],
hjust = 1.4,
vjust = -1) +
geom_line(aes(y = AdjustedSPYRebased), col = col_palette_green[7]) +
geom_point(aes(x = last(Date),
y = last(AdjustedSPYRebased)),
col = col_palette_green[7],
shape = 1,
size = 1.5) +
geom_text(label = "S&P 500",
aes(x = last(Date),
y = last(AdjustedSPYRebased)),
color = col_palette_green[7],
hjust = 1.4,
vjust = -1) +
geom_line(aes(y = AdjustedNASDAQRebased), col = col_palette_green[9]) +
geom_point(aes(x = last(Date),
y = last(AdjustedNASDAQRebased)),
col = col_palette_green[9],
shape = 1,
size = 1.5) +
geom_text(label = "NASDAQ",
aes(x = last(Date),
y = last(AdjustedNASDAQRebased)),
color = col_palette_green[9],
hjust = 1.4,
vjust = -2) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::percent,
breaks = seq(from = 0, to = 110, by = 10)) +
labs(title = "Is Tesla's Stock Price an Inflated Bubble, Close to Bursting?",
subtitle = "Tesla's Stock Price vs. S&P 500 Benchmark",
y = "Price Rebased (%)",
caption = "© Data Science & Technology Club HSG") +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
p_time_series_Tesla_vs_SPY
## Warning: Removed 878 row(s) containing missing values (geom_path).
It is pretty impressive by how much Tesla’s stock price outperforms the (already well performing) S&P 500. In particular beginning in mid October 2019, the volatility of the stock increases immensely, the sharp rise is contrasted by a sharp decline and a sharp rise again. It remains questionable, if Tesla’s recent stock price appreciation is sustainable and warranted in the long run. Let’s highlight the time during which Tesla’s stock price increase was most notable in the chart. We can do this with the annotate geom. Highlighting areas or specific parts of a chart is a useful element in story telling with data.
p_time_series_Tesla_vs_SPY +
annotate(geom = "rect",
xmin = as.Date("2019-10-15"),
xmax = last(df_Tesla_SPY_NASDAQ$Date) + 30,
ymin = -Inf,
ymax = Inf,
col = "grey",
alpha = 0.25) +
annotate(geom = "text",
label = "High Volatility Period",
x = as.Date("2020-04-01"),
y = -3)
## Warning: Removed 878 row(s) containing missing values (geom_path).
Next, we turn to one of the most basic, but also most useful plots - the scatter plot. First, however, we compute average mean returns for both financial instruments.
df_Tesla_SPY_NASDAQ_avg <- df_Tesla_SPY_NASDAQ %>%
summarise(SPY_mean = mean(ReturnsSPY, na.rm = T),
TSLA_mean = mean(ReturnsTSLA, na.rm = T))
We use geom_jitter() instead of geom_point() since this slightly and randomly dislocates individual observations in order to avoid overplotting, making the individual points better visible. Returns of the SPY go on the x-axis and returns of Tesla on the y-axis. We also highlight yesterday’s return, to see where it stands in comparison to historical returns. The if_else() function is pretty handy for this purpose.
p_scatter_Tesla_SPY <- df_Tesla_SPY_NASDAQ %>%
ggplot(aes(x = ReturnsSPY, y = ReturnsTSLA)) +
geom_jitter(aes(col = if_else(Date == max(Date, na.rm = T), "Today", "Historical")),
alpha = 0.5) + # geom_point()
# geom_vline(xintercept = df_Tesla_SPY_NASDAQ_avg$SPY_mean) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(name = "Date",
values = c(col_palette_blue[6], "red")) +
labs(title = "Scatter Plot",
subtitle = "SPY vs. TSLA Returns",
x = "SPY Returns (Continuous)",
y = "TSLA Returns (Continuous)",
caption = "© Data Science & Technology Club HSG") +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
p_scatter_Tesla_SPY
## Warning: Removed 879 rows containing missing values (geom_point).
Scatter plots are great to analyse the relationship between two (continuous) variables and are probably the most used charts in research and ML contexts. To check whether a linear relationship between returns of the SPY and Tesla exist, we can in addition add a regression line with geom_smooth(). The method argument is set to lm for linear model.
p_scatter_Tesla_SPY +
geom_smooth(method = "lm",
col = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 879 rows containing non-finite values (stat_smooth).
## Warning: Removed 879 rows containing missing values (geom_point).
By looking at the scatter plot and the dispersion of points, however, it is doubtful whether the relationship is truly linear. Thus, we can try to set another model, such as loess (local polynomial regression fitting), in geom_smooth().
p_scatter_Tesla_SPY +
geom_smooth(method = "loess",
col = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 879 rows containing non-finite values (stat_smooth).
## Warning: Removed 879 rows containing missing values (geom_point).
Did you notice how the regression line immediately became the center of our attention? This is due to the colour choice its mapped to in relation to other elements in the plot. Ideally, we use colours in a restrictive way to highlight specific and particularly important aspects in our visualisations.
Getting back to our relationship between SPY and Tesla returns, when looking at these plots alone, it remains unclear what the true relationship between the returns is. All we can say, is that Tesla on average seems to perform better when the U.S. stock market also performs well. However, the more extreme the returns are, the more uncertainty there is about the relationship, as indicated by the wider confidence intervals. This is due to the comparably little observations we have for extreme returns.
#
# TODO: Highlight largest bar(s) with colour and alpha
Create bar plot of TSLA stock volume
p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Volume)) +
geom_col() +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
caption = "© Data Science & Technology Club HSG") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6))
p_bar_Tesla_stock_volume
We can play with the width argument in geom_col to adjust the width of the bins plotted.
p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Volume)) +
geom_col(width = 0.2) +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
caption = "© Data Science & Technology Club HSG") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6))
p_bar_Tesla_stock_volume
First, we create a histogram to visualise the distribution of Tesla’s stock returns over time.
p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
ggplot(aes(x = ReturnsTSLA)) +
geom_histogram(bins = 500,
col = col_palette_blue[6],
alpha = 0.5) +
labs(title = "Histogram",
subtitle = "Tesla Stock Returns",
x = "Continuous Returns",
y = "Count") +
scale_x_continuous(label = scales::percent) +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
p_hist_Tesla
## Warning: Removed 879 rows containing non-finite values (stat_bin).
Then, we add a density to the distribution.
p_hist_Tesla <- p_hist_Tesla +
geom_density(kernel = "gaussian",
col = "red")
p_hist_Tesla
## Warning: Removed 879 rows containing non-finite values (stat_bin).
## Warning: Removed 879 rows containing non-finite values (stat_density).
Next, we add the average mean and medium return over time.
Tesla_returns_mean <- mean(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)
Tesla_returns_median <- median(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)
p_hist_Tesla +
geom_vline(xintercept = Tesla_returns_mean,
col = palette()[1]) +
geom_vline(xintercept = Tesla_returns_median,
col = palette()[8])
## Warning: Removed 879 rows containing non-finite values (stat_bin).
## Warning: Removed 879 rows containing non-finite values (stat_density).
# Determine y-axis density position of median, mean, and confidence intervals
p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
ggplot(aes(x = ReturnsTSLA)) +
stat_density(aes(y = ..scaled..),
geom = "line",
size = 0.5,
col = col_palette_blue[6],
adjust = 1) +
labs(title = "Histogram - Tesla Stock Returns",
x = "Continuous Returns",
y = "Count") +
scale_x_continuous(label = scales::percent) +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
mean_se <- sd(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T) / sqrt(length(df_Tesla_SPY_NASDAQ$ReturnsTSLA))
mean_conf_inter_l <- Tesla_returns_mean - 1.96 * mean_se
mean_conf_inter_u <- Tesla_returns_mean + 1.96 * mean_se
mean_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
slice(which.min(abs(x - Tesla_returns_mean))) %>%
pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
mean_conf_inter_l_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
slice(which.min(abs(x - mean_conf_inter_l))) %>%
pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
mean_conf_inter_u_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
slice(which.min(abs(x - mean_conf_inter_u))) %>%
pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
p_hist_Tesla +
geom_segment(x = Tesla_returns_mean,
xend = Tesla_returns_mean,
y = 0,
yend = mean_pos_y,
linetype = "solid",
color = col_palette_blue[6],
size = 0.4) +
geom_point(x = Tesla_returns_mean,
y = mean_pos_y,
col = col_palette_blue[6])
## Warning: Removed 879 rows containing non-finite values (stat_density).
# geom_area(x = mean_conf_inter_l,
# xend = mean_conf_inter_u,
# y = mean_conf_inter_l_pos_y,
# yend = mean_conf_inter_u_pos_y,
# linetype = "solid",
# color = "grey",
# size = 0.4)
If we want to display multiple series in a single plot, this is best done by using the ggplot2 facets layer. It is applied as a separate layer in our already existing time series plot. First, however, some data wrangling is required to transform the data from wide to long format.
df_Tesla_stock_data_long <- df_Tesla_stock_data %>%
select(-Volume) %>%
pivot_longer(cols = -Date,
names_to = "Variable",
values_to = "Values")
p_time_series_Tesla_faceted <- df_Tesla_stock_data_long %>%
ggplot(aes(x = Date, y = Values, col = Variable)) +
geom_line() +
facet_wrap(. ~ Variable) +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar,
breaks = seq(from = 0, to = 1750, by = 250)) +
scale_color_viridis(discrete = T) +
labs(title = "Faceted Stock Price Time Series - Tesla",
y = "Stock Price",
caption = "© Data Science & Technology Club HSG") +
theme(axis.text.x = element_text(angle = 60,
vjust = 0.5))
p_time_series_Tesla_faceted
To add some more spice to the previously built plots, we can turn them into interactive web graphs. This is where the plotly package comes in play. It is built on the plotly.js (Java Script) library and extremely useful and versatile when it comes to interactive plots used in reports, dashboards or web pages.
p_time_series_Tesla_faceted %>%
ggplotly()
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# FIXME: Annotation doesn't work yet
p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY %>%
ggplotly() %>%
layout(annotations = list(x = 1,
y = 1,
text = "© Data Science & Technology Club HSG"))
p_time_series_Tesla_vs_SPY
We start by computing the number of tweets Elon Musk writes per day and show summary statistics of those.
df_tweets_elon_musk_per_day <- df_tweets_elon_musk %>%
mutate(Date = as.Date(created_at)) %>%
group_by(Date) %>%
summarise(TweetsN = n())
## `summarise()` ungrouping output (override with `.groups` argument)
df_tweets_elon_musk_per_day %>%
summarise(Min = min(TweetsN, na.rm = T),
`1st Quartile` = quantile(TweetsN, probs = 0.25),
Median = median(TweetsN, na.rm = T),
Mean = round(mean(TweetsN, na.rm = T), digits = 2),
`3rd Quartile` = quantile(TweetsN, , probs = 0.75),
Max = max(TweetsN, na.rm = T)) %>%
datatable(caption = htmltools::tags$caption(tyle = "caption-side: bottom; text-align: center;",
"Table 1: ",
htmltools::em("Summary statistics of daily tweets by Elon Musk.")))
Next, we create a bar plot to visualise the number of tweets per day.
p_bar_tweets_elon_musk <- df_tweets_elon_musk_per_day %>%
ggplot(aes(x = Date, y = TweetsN, fill = TweetsN)) +
geom_col() +
scale_x_date(date_breaks = "1 month",
date_labels = "%Y %b") +
scale_y_continuous(breaks = seq(0, 60, 10)) +
labs(title = "Tweets by Elon Musk",
x = "Month",
y = "Number of Tweets") +
scale_fill_binned(type = "viridis") +
theme(axis.text.x = element_text(angle = 60,
hjust = 1))
p_bar_tweets_elon_musk <- p_bar_tweets_elon_musk %>%
ggplotly()
p_bar_tweets_elon_musk
We can compare this to the evolution of Tesla’s stock price.
subplot(p_time_series_Tesla_vs_SPY,
p_bar_tweets_elon_musk,
nrows = 2,
shareX = T)
Let’s see whether the number of tweets by Elon Musk per day are associated in any way with returns of Tesla’s stock. We naively try to do this with a scatter plot first.
df_Tesla_EM_tweets <- df_Tesla_SPY_NASDAQ %>%
full_join(df_tweets_elon_musk_per_day,
by = "Date") %>%
select(Date, ReturnsTSLA, TweetsN)
p_scatter_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
ggplot(aes(x = ReturnsTSLA, y = TweetsN)) +
geom_jitter(col = col_palette_blue[6],
alpha = 0.5) +
geom_vline(xintercept = 0,
size = 1,
alpha = 0.1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(breaks = seq(0, max(df_tweets_elon_musk_per_day$TweetsN, na.rm = T), 10)) +
labs(title = "Number of Daily Tweets vs. TSLA Returns",
subtitle = "Scatter Plot",
x = "TSLA Returns (Continuous)",
y = "Number of Tweets per Day",
caption = "© Data Science & Technology Club HSG") +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
p_scatter_Tesla_EM_tweets %>%
ggplotly()
We can also add a (polynomial) regression line to check for the association. However, as we already suspected from the simple scatter plot, there is no direct relationship visible here.
p_scatter_Tesla_EM_tweets_reg <- p_scatter_Tesla_EM_tweets +
geom_smooth(method = "loess",
col = "red")
p_scatter_Tesla_EM_tweets_reg
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3328 rows containing non-finite values (stat_smooth).
## Warning: Removed 3328 rows containing missing values (geom_point).
Hence, next, we produce a boxplot with the same underlying data as before. To do this, we need to sort the number of tweets into so-called “bins”. We choose a bin number of 12, thus splitting the number of tweets in bin widths of approximately 5.
p_boxplot_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
mutate(TweetsN = cut(TweetsN, breaks = 12)) %>%
filter_all(~ !is.na(.)) %>%
ggplot(aes(x = TweetsN, y = ReturnsTSLA)) +
geom_boxplot(col = col_palette_blue[6]) +
scale_y_continuous(labels = scales::percent) +
scale_color_viridis_d() +
labs(title = "Number of Daily Tweets vs. TSLA Returns",
subtitle = "Boxplot",
x = "Number of Tweets per Day",
y = "TSLA Returns (Continuous)",
caption = "© Data Science & Technology Club HSG") +
theme(legend.text = element_text(),
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75),
axis.text.x = element_text(angle = 90,
vjust = 1))
p_boxplot_Tesla_EM_tweets %>%
ggplotly()
We conclude that there is no clear association between the number of tweets per day and stock returns as there is no rising trend in the binned box plots (note that the plot axes are inverted here). So we can clearly say that our naive comparison of the number of tweets per day to stock returns shows no association.
# Get Tesla tweets
df_tweets_elon_musk_Tesla <- df_tweets_elon_musk %>%
filter(str_detect(text, pattern = "Tesla"))
So now, let’s dive deeper and take a look at Musk’s infamous “taking-Tesla-private” tweet from 7 August 2018.
p_time_series_Tesla_private_tweet <- df_Tesla_stock_data %>%
filter(between(Date,
as.Date("2018-07-01"),
as.Date("2018-09-14"))) %>%
mutate(Date = as_datetime(Date, tz = "UTC")) %>%
ggplot(aes(x = Date, y = Adjusted)) +
geom_line(col = col_palette_blue[6]) +
geom_point(col = col_palette_blue[6]) +
geom_vline(xintercept = as_datetime("2018-08-07 12:48:00"),
col = "red",
alpha = 1)
# annotate(geom = "text",
# label = "High Volatility Period",
# x = as.Date("2020-04-01"),
# y = -3)
p_time_series_Tesla_private_tweet
# p_time_series_Tesla_private_tweet %>%
# ggplotly()
We next tokenize the tweets into words. This enables us to quantitatively analyse them.
df_tweets_elon_musk <- df_tweets_elon_musk %>%
unnest_tokens(output = words,
input = text,
token = "words")
When we count which words appear most often in the tweets, we see that they are common ones such as “to”, “the”, etc. These are known as stop words and it makes sense to remove them for a meaningful analysis.
df_tweets_elon_musk %>%
count(words) %>%
arrange(desc(n)) %>%
datatable()
Let’s do just that and voila - the most frequently used words in the tweet now make much more sense and we can actually start using them for further analysis.
stop_words_custom <- tribble(~word, ~lexicon,
"http", "CUSTOM",
"https", "CUSTOM",
"t.co", "CUSTOM")
stop_words_final <- stop_words %>%
bind_rows(stop_words_custom)
df_tweets_elon_musk_cleaned <- df_tweets_elon_musk %>%
anti_join(stop_words_final,
by = c("words" = "word"))
df_tweets_elon_musk_cleaned %>%
count(words) %>%
arrange(desc(n)) %>%
datatable()
We visualise the number of times the words occur in Musk’s tweets first with a simple flipped bar plot.
df_tweets_elon_musk_cleaned %>%
count(words) %>%
filter(n >= 70) %>%
ggplot(aes(x = fct_reorder(words, n), y = n)) +
geom_col(aes(fill = if_else(str_detect(words, pattern = "tesla"),
"red",
"blue"))) +
coord_flip() +
scale_fill_manual(values = c("red" = "red", "blue" = col_palette_blue[6])) +
labs(title = "Tesla Seems Indeed to be Important for Elon Musk… (It's All He Talks about All-Day Long!)",
subtitle = "Word Counts in Elon Musk's Tweets",
x = "Word",
y = "Word Counts",
caption = "© Data Science & Technology Club HSG") +
theme(legend.position = "none",
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
# TODO
# wordcloud2()
To perform a sentiment analysis on the content of Elon Musk’s tweets, we match the tweets with the sentiment dictonary nrc and visualise the results.
df_tweets_elon_musk_sentiment_nrc <- df_tweets_elon_musk_cleaned %>%
inner_join(get_sentiments("nrc"),
by = c("words" = "word"))
p_bar_tweets_elon_musk_sentiment_nrc <- df_tweets_elon_musk_sentiment_nrc %>%
count(sentiment) %>%
arrange(desc(n)) %>%
mutate(colour = if_else(sentiment %in% c("positive", "trust", "anticipation", "joy"),
"green",
"red")) %>%
ggplot(aes(x = fct_reorder(sentiment, n), y = n, fill = colour)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("red" = "red", "green" = col_palette_green[7])) +
labs(title = "Musk Seems to Look on the Bright Side of Life…",
subtitle = "Sentiment Analysis of Elon Musk's Tweets",
x = "Sentiment",
y = "Word Counts",
caption = "© Data Science & Technology Club HSG") +
theme(legend.position = "none",
plot.title = element_text(face = "bold"),
axis.line = element_line(size = 0.75))
p_bar_tweets_elon_musk_sentiment_nrc %>%
ggplotly()
Let’s check how accurately we matched words in the tweets to specific sentiments.
#
Candlestick chart
df_Tesla_stock_data %>%
plot_ly(x = ~ Date,
type = "candlestick",
open = ~ Open,
close = ~ Close,
high = ~ High,
low = ~ Low) %>%
layout(title = "Candlestick Chart - Tesla Stock Price")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
df_Tesla_stock_data %>%
filter(Date >= "2020-01-01") %>%
plot_ly(x = ~ Date,
type = "candlestick",
open = ~ Open,
close = ~ Close,
high = ~ High,
low = ~ Low) %>%
layout(title = "Candlestick Chart - Tesla Stock Price")
OHLC
# TODO: Add nicer colors
p_LC <- df_Tesla_stock_data %>%
ggplot(aes(x = Date, y = Adjusted)) +
geom_line(size = 1) +
geom_line(aes(y = Low),
col = palette()[1],
linetype = "dashed") +
geom_line(aes(y = High),
col = palette()[8],
linetype = "dashed") +
geom_ribbon(aes(ymin = Low,
ymax = High),
alpha = 0.4) +
labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
scale_y_continuous(labels = scales::dollar)
p_LC %>%
ggplotly()
Simple animated time series plot with plotly abd the gapminder data
gapminder %>%
filter(country %in% c("China", "United States", "United Kingdom", "India",
"Germany", "Switzerland", "Austria", "Japan", "Singapore")) %>%
plot_ly(x = ~ lifeExp,
y = ~ gdpPercap,
size = ~ pop,
color = ~ country,
frame = ~ year,
type = "scatter",
mode = "markers",
colors = palette())